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Abstract. We formulate an affine invariant implementation of the algorithm in [Nesterov, 1983]. We show 
that the complexity bound is then proportional to an affine invariant regularity constant defined with respect to 
the Minkowski gauge of the feasible set. 



O ■ 1. INTRODUCTION 

^ ', In this short note, we show how to implement the smooth minimization algorithm described in [Nesterov, 

' 1983, 2005] so that both its iterations and its complexity bound are invariant by a change of coordinates in 

the problem. We focus on the minimization problem 

m ■ 

minimize f(x) 

^ . subject to X £ Q, 

Q ' where / is a convex function with Lipschitz continuous gradient and Q is a compact convex set. Without 

too much loss of generality, we will assume that the interior of Q is nonempty and contains zero. When Q 
is sufficiently simple, in a sense that will be made precise later, Nesterov [1983] showed that this problem 
could be solved with a complexity of 0(1/ ^/e), where e is the precision target. Furthermore, it can be shown 
that this complexity bound is optimal for the class of smooth problems [Nesterov, 2003]. 

While the dependence in 0{l/^/e) of the complexity bound in Nesterov [1983] is optimal, the constant in 
front of that bound still depends on a few parameters which vary with implementation: the choice of norm 
and prox regularization function. This means in particular that, everything else being equal, this bound is not 
\^ • invariant with respect to an affine change of coordinates, so the complexity bound varies while the intrinsic 

^ , complexity of problem (1) remains unchanged. Here, we show one possible fix for this inconsistency, by 

choosing a norm and a prox term for the algorithm in [Nesterov, 1983, 2005] which make its iterations and 
complexity invariant by a change of coordinates. 



2. Smooth Optimization Algorithm 



We first recall the basic structure of the algorithm in [Nesterov, 1983]. While many variants of this 

^ I method have been derived, we use the formulation in [Nesterov, 2005]. We choose a norm || • || and assume 

H ' that the function / in problem (1) is convex with Lipschitz continuous gradient, so 

Cd ■ 

f{y)<f{x) + {Vf{x),y-x) + ]^L\\y-xf, x,y £ Q, (2) 

for some L > 0. We also choose a prox function d{x) for the set Q, i.e. a continuous, strongly convex 
function on Q with parameter a (see Nesterov [2003] or Hiriart-Urruty and Lemarechal [1993] for a dis- 
cussion of regularization techniques using strongly convex functions). We let xq be the center of Q for the 
prox-function d{x) so that 

xo = argmind(x), 
assuming w.l.o.g. that (i(xo) = 0, we then get in particular 

d{x) > ^a\\x - xoll^. (3) 
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We write Tq{x) a solution to the following subproblem 

TQ{x)^avgmm\{Vf{x),y-x) + lL\\y-xf] (4) 
y&Q I ^ ) 

We let 2/0 — Tq{xq) where xq is defined above. We recursively define three sequences of points: the current 
iterate Xk, the corresponding tjk = TQ{xk), and the points 

Zk = argmin \ —d{x) + ai[f{xi) + {'Vf{xi),x - Xi)] > (5) 
and a step size sequence > with aQ £ {0, 1] so that 

Xk+l = TkZk + (1 - T-fc)?/fc 

?/fe+i = rQ(a:fc+i) 

where = ak+i/Ak+i with = J2i=o ^i- implicitly assume here that Q is simple enough so that 
the two subproblems defining yk and Zk can be solved very efficiently. We have the following convergence 
result. 

Theorem 2.1. Suppose ak = {k + 1) /2 with the iterates Xk, yk cmd Zk defined in (5) and (6), then for any 
/c > we have 

f{yk) - fix )< 



a{k + l) 

where x* is an optimal solution to problem (1). 
Proof. See Nesterov [2005]. ■ 



If e > is the target precision, Theorem 2.1 ensures that Algorithm 1 will converge to an e-accurate 
solution in no more than 

./MS (7) 

V ere 

iterations. In practice of course, d(x*) needs to be bounded a priori and L and a are often hard to evaluate. 



Algorithm 1 Smooth minimization. 

Input: Xq, the prox center of the set Q. 
1: for A; = 0, ... ,iVdo 
2: Compute y f{xk)- 
3: Compute yfc = rQ(xfc). 

4: Compute Zk = argmin^gg + Ei=o«4/(a^i) + (V/(xi),x - Xi)] 

5: Set Xk+i = TkZk + (1 - Tk)yk- 
6: end for 
Output: XN,yN £ Q- 



While most of the parameters in Algorithm 1 are set explicitly, the norm || • || and the prox function d{x) 
are chosen arbitrarily. In what follows, we will see that a natural choice for both makes the algorithm affine 
invariant. 
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3. Affine invariant implementation 



We can define an affine change of coordinates x = Ay where A G M"^" is a nonsingular matrix, for 
which the original optimization problem in (1) is transformed so 

minimize f(x) , minimize f(y) 

u- VV> becomes ■'^^'^ (8) 

subject to xgQ, subject to y € Q, 

in the variable y G M", where 

hy)^f{Ay) and Q ^ A'^Q. (9) 

Unless A is pathologically ill-conditioned, both problems are equivalent and should have invariant complex- 
ity bounds and iterations. In fact, the complexity analysis of Newton's method based on the self-concordance 
argument developed in [Nesterov and Nemirovskii, 1994] produces affine invariant complexity bounds and 
the iterates themselves are invariant. Here we will show how to choose the norm || • || and the prox function 
d{x) to get a similar behavior for Algorithm 1. 

3.1. Choosing the norm. We start by a few classical results and definitions. Recall that the Minkowski 
gauge of a set Q is defined as follows. 

Definition 3.1. Let Q C M" containing zero, we define the Minkowski gauge ofQ as 

7q(x) = inf{A >0:xeXQ} 
with ^q{x) = when Q is unbounded in the direction x. 

When Q is a compact convex, centrally symmetric set with respect to the origin and has nonempty interior, 
the Minkowski gauge defines a norm. We write this norm || • \\q = 7q(-). From now on, we will assume 
that the set Q is centrally symmetric or use for example Q = Q — Q (in the Minkowski sense) for the gauge 
when it is not (this can be improved and extending these results to the nonsymmetric case is a classical 
topic in functional analysis). Note that any affine transform of a centrally symmetric convex set remains 
centrally symmetric. The following simple result shows why || • ||q is potentially a good choice of norm for 
Algorithm 1. 

Lemma 3.2. Suppose f : M" — t- R, Q is a centrally symmetric convex set with nonempty interior and let 
A G M"^" be a nonsingular matrix. Then f has Lipschitz continuous gradient with respect to the norm 
II • II Q with constant L > 0, i.e. 

fiy) <f{x) + {Vf{x),y-x) + ^L\\y-x\\l, x,yeQ, 

if and only if the function f{Aw) has Lipschitz continuous gradient with respect to the norm \\ ■ \\a-^q with 
the same constant L. 



Proof. Let y = Az and x = Aw, then 

f{y) <f{x) + {Vf{x),y-x) + ^L\\y-x\\l, x,yeQ, 
is equivalent to 

f{Az) < f{Aw) + {A-^Vy,f{Aw),Az - Aw) + ^L\\Az - Aw\\l, z,w e A'^Q, 
and, using the fact that ||j4x||q = ||x||^-iq, this is also 

f{Az) < f{Aw) + {V^f{Aw),A-\Az - Aw)) + h\\z - w\\\.rQ, z,w e A'^Q, 
hence the desired result. ■ 

An almost identical argument shows the following analogous result for the property of strong convexity 
with respect to the norm || • ||q and affine changes of coordinates. 
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Lemma 3.3. Suppose f : M" R, Q is a centrally symmetric convex set with nonempty interior and let 
A G M"^" ftg a nonsingular matrix. Suppose f is strongly convex with respect to the norm \\ ■ \\q with 
parameter cr > 0, i.e. 

f{y) > f{x) + {Vf{x),y- x) + ]^a\\y - x||q, x,y e Q, 

if and only if the function f{Ax) is strongly convex with respect to the norm \\ ■ ||^-iq with the same 
parameter a. 

We now turn our attention to the choice of prox function in Algorithm 1. 

3.2. Choosing the prox. Choosing the norm as || • \\q allows us to define a norm without introducing 
an arbitrary geometry in the algorithm, since the norm is extracted directly from the problem definition. 
When Q is smooth, a similar reasoning allows us to choose the prox term in Algorithm 1 , and we can set 
d{x) = \\x\\q. The immediate impact of this choice is that the term d{x*) in (7) is bounded by one, by 
construction. This choice has other natural benefits which are highlighted below. We first recall a result 
showing that the conjugate of a squared norm is the squared dual norm. 

Lemma 3.4. Let \\ ■ \\ be a norm and \\ ■ \\* its dual norm, then 

^ r\\ ll*\2 T II ||2 

2 i\\y\\ ) = sup y X - -||x|| . 

Proof. We recall the proof in [Boyd and Vandenberghe, 2004, Example 3.27] as it will prove useful in 
what follows. By definition, x^y < ||y||*||x||, hence 

T 1 II ||2 / II 11*11 II 1 II ||2 ^ 1 /II \\*\2 

yx--\\x\\ < ||y|| ||x|| - -||x|| <-(||y||) 

because the second term is a quadratic function of ||x|p, with maximum (||y||*)^/2. This maximum is 
attained by any x such that x^y = ||y||*||x|| (there must be one by construction of the dual norm), normalized 
so ||x|| = ||y||*, which yields the desired result. ■ 

This last result (and its proof) shows that solving the prox mapping is equivalent to finding a vector 
aligned with the gradient, with respect to the Minkowski norm || • \\q. We now recall another simple result 
showing that the dual of the norm || • ||q is given by || • ||qo where Q° is the polar of Q. 

Lemma 3.5. Let Q be a centrally symmetric convex set with nonempty interior, then \\ ■ \\q = \\ ■ \\qo. 

Proof. We write 

||x||qo = inf{A > : X G XQ°} 

= inf{A > : x^y < t, for all y e Q} 

= inf < A > : sup x'^y < t> 
[ y<^Q J 

T 

= sup X y 

y&Q 



X 



which is the desired result. 



The last remaining issue to settle is the strong convexity of the squared Minkowski norm. Fortunately, 
this too is a classical result in functional analysis, as a squared norm is strongly convex with respect to itself 
if and only if its dual norm has a smoothness modulus of power 2. 

However, this does not cover the case where the norm ||.||q is not smooth. In that scenario, we need to 
pick the norm based on Q but find a smooth prox function not too different from ||.||q. This is exactly the 
problem studied by Juditsky and Nemirovski [2008] who define the regularity of a Banach space {E, \\.\\e) 
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in terms of the smoothness of the best smooth approximation of the norm ||.||£;. We first recall a few 
more definitions, and we will then show that the regularity constant defined by Juditsky and Nemirovski 
[2008] produces an affine invariant bound on the term d{x*)/a in the complexity of the smooth algorithm 
in [Nesterov, 1983]. 

Definition 3.6. Suppose \\ ■ \\x and \\ ■ ||y are two norms on a space E, the distortion d{\\ ■ \\x,\\ ' \\y) 

between these two norms is equal to the smallest product ab > such that 

1 

T F y ^ F X ^ F y 
b 

over all x £ E. 

Note that log d{\\ ■ \\x, \\ ■ \\y) defines a metric on the set of all symmetric convex bodies in M", called the 
Banach-Mazur distance. We then recall the regularity definition in Juditsky and Nemirovski [2008]. 

Definition 3.7. The regularity constant of a Banach space {E, \\.\\) is the smallest constant A > 0/or which 
there exists a smooth norm p{x) such that 

(i) p{x)'^/2 has a Lipschitz continuous gradient with constant fi w.r.t. the norm p{x), with 1 < < A, 

(ii) the norm p{x) satisfies 

\\xf<pixf<—\\xf, forallxGE (10) 



hence d{p{x), ||.||) < A/fi. 
Note that in finite dimension, since all norms are equivalent to the EucUdean norm with distortion at most 



^/dhnE, we know that all finite dimensional Banach spaces are at least (dim £') -regular. Furthermore, the 
regularity constant is invariant with respect to an affine change of coordinates since both the distortion and 
the smoothness bounds are. We are now ready to prove the main result of this section. 

Proposition 3.8. Let e > be the target precision, suppose that thefimction f has a Lipschitz continuous 
gradient with constant Lq with respect to the norm \\ ■ \\q and that the space (M", || • ||q) is Dq-regular, 
then Algorithm 1 will produce an e-solution to problem (l) in at most 



Lq m\n{DQ/2,n} ^^^^ 
e 

iterations. The constants Lq and Dq are affine invariant. 

Proof. If (M", II • \\q) is Dg-regular, then by Definition 3.7, there exists a norm p{x) such that p(x)^/2 
has a Lipschitz continuous gradient with constant /x with respect to the norm p{x), and [Juditsky and Ne- 
mirovski, 2008, Prop. 3.2] shows by conjugacy that the prox function d{x) = p*(x)^/2 is strongly convex 
with respect to the norm p*{x) with constant 1/p,. Now (10) means that 



since || • ||** = || • ||, hence 



-p^ \\x\\q < p*{x) < \\x\\q, for all a; gQ 
Do 



d{x + y) > d{x) + {dd{x) , y) + ^p* {yf 

2/i 

> d{x) + {dd{x),y) +^\\y\\l 



so d{x) is strongly convex with respect to || • ||q with constant a = 1/ Dq, and using (10) as above 

d{x*) _ p*{x*fDQ ^ \\x%Dq ^ Dq 
o 2-2-2 



by definition of || • ||q, if x* is an optimal (hence feasible) solution of problem (1). The bound in (11) 
then follows from (7) and its affine invariance follows directly from affine invariance of the distortion and 
Lemmas 3.2 and 3.3. ■ 



4. Examples 

To illustrate our results, consider the problem of minimizing a smooth convex function over the unit 
simplex, written 

minimize f{x) 

subject to l^x < 1, X > 0, 
in the variable x G M". As discussed in [Juditsky et al., 2009, §3.3], choosing || • ||i as the norm and 
d{x) = log n + J2'i=i the prox function, we have a = 1 and d{x*) < log n, which means the 

complexity of solving (12) using Algorithm 1 is bounded by 



e 

where Li is the Lipschitz constant of V/ with respect to the £i norm. This choice of norm and prox has a 
double advantage here. First, the prox term d{x*) grows only as logn with the dimension. Second, the £ca 
norm being the smallest among all ip norms, the smoothness bound Li is also minimal among all choices 
of ip norms. 

Let us now follow the construction of Section 3. The simplex C = {x G M" : l^x < 1, x > 0} is not 
centrally symmetric, but we can symmetrize it as the £i ball. The Minkowski norm associated with that set 
is then equal to the £i-norm, so || • ||q = || • ||i here. The space (M", || • ||oo) is 2 logn regular [Juditsky and 
Nemirovski, 2008, Example 3.2] with the prox function chosen here as || • ||(2iogn)/^- Proposition 3.8 then 
shows that the complexity bound we obtain using this procedure is identical to that in (13). A similar result 
holds in the matrix case. 



5. Conclusion 

From a practical point of view, the results above offer guidance in the choice of a prox. function de- 
pending on the geometry of the feasible set Q. On the theoretical side, these results provide affine invariant 
descriptions of the complexity of the feasible set and of the smoothness of the objective function, written 
in terms of the regularity constant of the polar of the feasible set and the Lipschitz constant of V/ with 
respect to the Minkowski norm. However, while we show that it is possible to formulate an affine invariant 
implementation of the optimal algorithm in [Nesterov, 1983], we do not yet show that this is always a good 
idea... In particular, given our choice of norm the constants Lq and Dq are both affine invariant, with Lq 
optimal by construction and our choice of prox function minimizing Dq over all smooth square norms, but 
this does not mean that our choice of norm (Minkowski) minimizes the product Lq mm{DQ/2, n}, hence 
that we achieve the best possible bound for the complexity of the smooth algorithm in [Nesterov, 1983]. 
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